37

stochastic models such as hidden Markov models (Sean R Eddy 2004) allow hidden sys­

tem states (e.g. exon, intron) to be predicted from a sequence (observations, e.g. ATCCCTG

...) using a Markov chain (Bayesian network; supervised machine learning). Hidden

Markov models are widely used for genome annotation (exon-intron region; e.g. GenScan

program), but also for protein domain prediction (e.g. Pfam, SMART, HMMER, InterPro

databases) and network regulation (e.g. signal peptides; SignalP, TMHMM programs).

In addition, there are numerous special software that detect RNA sequences (e.g. Rfam,

tRNAscan), viral sequences, repeat regions (e.g. Repeat Masker) and other sites in the

genome (e.g. enhancers, miRNAs, lncRNAs) and label them accordingly.

An important step is also to take a closer look at the promoter. Transcription factors

bind to DNA sequence motifs (Patrik D’haeseleer 2006) in the promoter (so-called tran­

scription factor binding sites, TFBS) and thus regulate gene expression (transcription).

These conserved DNA patterns, usually consisting of 8–20 nucleotides, can be recognized

by computers using binding site pattern recognition algorithms based on experimental

data, such as chromatin immunoprecipitation DNA sequencing (Chip-Seq). A distinction

is made between probabilistic (binding site; position weight matrix), discriminant (sites +

non-functional sites) and energy (site + binding free energy) TFBS models (Stormo 2010,

2013). Databases such as Transfac and JASPAR contain the TFBS matrices for different

organisms. These can be used, for example, to search a sequence for TFBS to understand

gene expression (e.g. MotifMap, Alggen Promo, TESS, etc. programs), but also to find

possible regulation via modular TFBS (TF modules) (e.g. using the Genomatix program).

Besides, ab initio approaches (e.g. MEME Suite and iRegulon) try to find recurrent

sequence patterns in multiple sequences via multiple alignment, which are then compared

to known TFBS motifs for similarity. For example, we showed in one paper that heart

failure-associated Chast-lncRNA is regulated by promoter binding of Nfat4 (Viereck

et al. 2016).

In this way, from 1995 onwards (with E. coli and the yeast cell), the first genomes

began to be completely labelled and published. This was followed by the genomes of

eukaryotes (cells with a cell nucleus), which were about a thousand times larger, in par­

ticular that of humans (2001) and many other higher organisms (fly, mosquito, mouse, rat,

chimpanzee, chicken, fish, etc.).

Another aspect is then to assemble the encoded proteins, RNAs and elements into

higher networks. For example, a single enzyme does not stand alone, but forms metabolic

networks (see next chapter). In the same way, a transcription factor that binds to the pro­

moter of a gene does not stand alone, but is part of the overall regulation (so-called regula­

tory networks, see next but one chapter). The precise description of individual genes often

requires not only DNA but also RNA (“transcriptome”), in particular in order to precisely

determine the beginnings and ends of the segments overwritten in RNA. An integrative

analysis yields the most accurate results here, even in the case of viruses with their com­

pact genome (Whisnant et al. 2020).

3.1  Sequencing Genomes: Spelling Genomes